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Abstract 

We present results for application of block BiCGSTAB algorithm modified by the QR decomposition and the SAP 
preconditioner to the Wilson-Dirac equation with multiple right-hand sides in lattice QCD on a 32 3 x 64 lattice at 
almost physical quark masses. The QR decomposition improves convergence behaviors in the block BiCGSTAB 
algorithm suppressing deviation between true residual and recursive one. The SAP preconditioner applied to the 
domain-decomposed lattice helps us minimize communication overhead. We find remarkable cost reduction thanks 
to cache tuning and reduction of number of iterations. 

Keywords: Lattice gauge theory, Lattice Dirac equation, Multiple right-hand sides, Block Krylov subspace, Domain 
decomposition 



1. Introduction 

Lattice QCD simulations initiated 30 years ago stand 
finally at the point where one can obtain results of phys- 
ical observables at the physical up, down and strange 
quark masses Jl|]. The next steps would be refinement 
of the results reducing the systematic errors and chal- 
lenge to computationally difficult problems, e.g. calcu- 
lation of disconnected diagrams. A main difficulty in 
lattice QCD simulations is that solution of Dirac equa- 
tion, which have to be repeated many times both in 
configuration generation and measurement of physical 
observables on given configurations, is computationally 
expensive near the physical up and down quark masses. 
In the measurement of physical observables, however, 
computational cost may be reduced by block Krylov 
subspace methods yfl, since its expensive part is the 
multiple right-hand side problem. (This is not the case 
for configuration generation.) One can expect that block 
Krylov subspace methods make convergence faster with 
the aid of better search vectors generated from wider 
Krylov subspace enlarged by number of right-hand side 
vectors in comparison with non-blocked method. An- 
other possible ingredient to improve performance is an 
efficient use of memory bandwidth in implementation 
of block matrix-vector multiplication. 



Since the Dirac matrix in lattice QCD is non- 
Hermitian, we might expect the block BiCGSTAB al- 
gorithm J3| is applicable in a straightforward way. One 
problem in block Krylov subspace methods, however, 
is that the true residual stops decreasing at some point, 
while the recursive one continues to decrease. Recently, 
three of the authors have proposed a new algorithm 
named block BiCGGR, which showed significant im- 
provement for this problem El H ■ 

In this paper we improve block BiCGSTAB algo- 
rithm with two modifications. First one is the QR 
decomposition, which is known to improve the nu- 
merical accuracy in block CG J2, 01 and also use- 
ful for block BiCGSTAB algorithm H . Second one 
is Schwarz alternating procedure (SAP) preconditioner 
proposed by Luscher H, which is applied to the 
domain-decomposed lattice. We can minimize commu- 
nication overhead with the SAP preconditioner. 

This paper is organized as follows. In Sec. [2] we 
explain the algorithmic details of the modified block 
BiCGSTAB with SAP preconditioner. We present the 
results of the numerical test in Sec. [3] Conclusions and 
discussions are summarized in Sec. [4] 
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2. Algorithm 

2.1. Modified Block BiCGSTAB 

We consider to solve the linear systems with the mul- 
tiple right-hand sides expressed as 



AX = B, 



(1) 



where A is an N x N complex sparse non-Hermitian ma- 
trix. X and B are N x L complex rectangular matrices 
given by 



X = (x m ,...,x«\...,x (L) ), 
B = (b (l \...,b (i \...,b <L) ). 



(2) 
(3) 



In the case of the Wilson-Dirac equation the matrix di- 
mension is given by N = L x x L v X L z X L t X 3 X 4 
with L x x L Y x L : x L, the volume of a hypercubic four- 
dimensional lattice. L is the number of the right-hand- 
side vectors which is called source vectors in lattice 
QCD. L is 12 in the simplest case and O(10) - 0(100) 
(perhaps more in some case) for the stochastic method. 

The matrix-vector multiplication for the Wilson- 
Dirac equation is written as 



L,xL v y.L-xL, 
a=1 



(4) 



Tlx = 2 [(! " YrWxj&x* + (1 + 7,)Ul^,-f] , (5) 

where (f> x and rj x contain 12 complex numbers at site x, 
jft are the gamma matrices, U X fi are link variables of 
SU(3) matrix and k is hopping parameter. Computa- 
tion of r] x requires 132C0 floating point number opera- 
tions and 360 words per site. This means the value of 
Flops/Byte is about 0.9 (0.45) in the single (double) pre- 
cision. It should be difficult to obtain high performance 
on recent computer architecture. 

In block Krylov subspace methods, Eq. (f5J) can be 
expressed as 

r§ = J [(1 - 7^4,% + (1 + YjUl^] , 



(6) 



with ; = 1 , . . . , L. An important point is that same 8 link 
variables around site x are used in common for </> w with 
; = 1, . . . ,L and their size is just 576 (1 152) bytes in the 
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Figure 1: Flops/Byte as a function of L for standard (black circle) 
and 0(a)-improved (red square) Wilson-Dirac operators in the single 
precision. 



single (double) precision, which are small enough to be 
retained in low level cache, for example LI cache (if 
there is). This allows us more efficient usage of cached 
data than repeating L independent matrix-vector multi- 
plications. Figure[TJillustrates how Flops/Byte increases 
as L is enlarged. For an effective use of cache, we ar- 
range loop for the index of i (i = 1, . . . , L) in the most 
inner level with ; running first in memory allocation for 
vectors. 

Pseudocode for modified block BiCGSTAB algo- 
rithm is described in Algorithm 12.11 Note that precon- 
ditioner M at lines 4.2 and 4.6 in Algorithm 12. 1 1 must 
be implemented by a stationary iterative method in this 
algorithm since the common preconditioning should be 
applied to all right-hand sides, though nonstational it- 
erative methods are often used in lattice QCD ifioll . 
The orthogonalization of P improves numerical accu- 
racy since each span works effectively to search approx- 
imated solutions. We employ modified Gram-Schmidt 
method for the QR decomposition. Even when non- 
block BiCGSTAB fails to converge, modified block 
BiCGSTAB may converge by adding dummy right-hand 
side vectors if they can play a supplementary role 
We also present a memory saving version in Algo- 
rithm O 



2.2. Preconditioning 

In this work we employ the <9(a)-improved@ Wilson 
fermions. After Jacobi preconditioning (division of / + 



1 1296 flops if is non-relativistic representation. 



2 The leading cut-off error in terms of the lattice spacing a is re- 
moved. 
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Algorithm 2.1: Modified Block BiCGSTAB(A, M, B, e) 

1 initial guess X e C NxL 

2 R = B - AX 

3 P = R 

4 choose R e C NxL 
while max,(|r ( '' , |/l* (, ' ) l) < e 

4.1 QR decomposition P = Qy, P <- Q 

4.2 U = MP 

4.3 V = A{/ 

4.4 solve (fl"V)o< = R"R for a 

4.5 r = R - Va 

4.6 S =MT 

4.7 Z = AS 

4.8 i = Ti(Z»T k )/Tr(Z»Z k ) 

4.9 X <^ X + Ua + £S 

4.10 R = T - £Z 

4.11 solve (R H V)B = -R H Z for B 

4.12 P^R + (P-(V)B 

5 return (X) 



do 



Algorithm 2.2: Memory Saving Version(A, M, B, e) 

1 initial guess X e C NxL 

2 R = B - AX 

3 P = R 

4 choose R e C NxL 
while max,(|r (0 |/l& (0 l) < e 

4. 1 QR decomposition P = Qy, P <- g 

4.2 U = MP 

4.3 V = A(/ 

4.4 solve V)» = for a 

4.5 R^ R-Va 

4.6 X «- X + Ua 
do 4.7 5 = MR 

4.8 Z = A5 

4.9 £ = Tr(Zf ^)/Tr(ZfZ,) 

4.10 X <- X + ^S 

4.11 R<-R-(Z 

4.12 solve (£ H V)j8 = -fi"Z foryS 

4.13 P<-R + (P-(V)B 

5 return (X) 



clover term), the matrix A is decomposed as the follow- 
ing 2x2 blocked matrix form, 



Aee Aeo 

A-OE Ago 



(7) 



where the subscript E and O denote the even and odd 
domains, respectively. The SAP preconditioner M$ap is 
computed as 



N S AP 

M sap =kJ](1-AK) 



K = 



Bee 
-BqoAoeBee 




Boo 



(8) 



where Bee (Boo) is an ap prox imation for A EE (A l ) 
obtained by SSOR method 111 



Bee = (l-ojU EE r ] [ J (1 - A s E s E 0R y](l - ojLeeT 1 , 

(9) 

with 



1 ££ 



cdLee) 1 +(1 -ojUee) 1 



+ (<u - 2)(1 - wL ££ )-'(l - wt/^r 1 ] 



(10) 



is the forward hopping term and Uee is the back- 
ward one. We perform SAP preconditioning in the sin- 
gle precision for effective use of memory bandwidth and 
network bandwidth between nodes. 

It is known that "sloppy" precision can be used in 
right preconditioning, but not in left one. Suppose cal- 
culation of S — MT at line 4.6 in Algorithm |2T| is per- 
formed with "sloppy" precision in k-th iteration. Nu- 
merical errors for S k , Z k , £ k and X k+ i may be expressed 
as 



S k 

z k 

ft 

Xk+i 
These yield 

R 



S' k - S k + 6Sk, 
Z' k =AS' k , 

& = & + <% , 

X' k+i = X k + U k a k + £ k S' k , 



(11) 
(12) 
(13) 
(14) 



n - Rk - V k a k - £ k Z' k 
= R k -AU k a k -? k AS' k 
= B-AX k -A(U k a k + (' k S' k ) 
= B-AX' k+1 , 



(15) 
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which satisfies the exact relation between approximate 
solutions and residuals in (k + l)-th iteration. For the 
case that both U = MP at line 4.2 and S = MT at line 
4.6 are computed with "sloppy" precision one can also 
reproduce the above relation with the following formu- 
lae: 





-> U' k = u k + su k , 


(16) 


v k 


-» V' k = AU' k , 


(17) 


ait 


-> a' k = a k + 5a , 


(18) 


T k 


-» T' k =R k -V' k a' k , 


(19) 


s k 


— > S k = S k + 5S , 


(20) 


z k 




(21) 


ft 


-» + 


(22) 


X/c+l 




(23) 
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Figure 2: Representative case for residual norm as a function of num- 
ber of iteration with L = 1, 2, 3, 4, 6, 12. 



3. Numerical test 

3.1. Choice of parameters 

We test modified block BiCGSTAB employing a so- 
called "local source", B = [e\,...,e£\, with L — 12 for 
color-spin components. We use statistically indepen- 
dent 10 configurations generated at almost the physical 
point, K ud = 0.137785 and k s = 0.136600, in 2+1 flavor 
lattice QCD with the nonperturbatively <9(a)-improved 
Wilson quark action and the Iwasaki gauge action IU2I1 



we modify the lattice QCD simulation program LD- 
DHMC/verl.04bl2.31 developed by PACS-CS Collab- 
oration 111311 . 



at B = 1.9 on a 32 3 x 64 lattice HI. We choose the 
hopping parameter k = 0.137785 for the Wilson-Dirac 
equation and Nsap = 5 with 8x8x8x8 domain size 
for the SAP preconditioning following Ref. QJJ] . Param- 
eters for SSOR method are also fixed with Nssor = 1 
and <d = 1.26. The stopping criterion is set to be 
max,-(|r®|/|A®|) < e with e = 10~ 14 . 

3.2. Test environment 

Numerical test is performed on 16 nodes of a 
large-scale cluster system called T2K-Tsukuba. The 
machine consists of 648 compute nodes providing 
95.4Tflops of computing capability. Each node con- 
sists of quad-socket, 2.3GHz Quad-Core AMD Opteron 
Model 8356 processors whose on-chip cache sizes are 
64KByt.es/core, 512KBytes/core, 2MB/chip for LI, L2, 
L3, respectively. Each processor has a direct connect 
memory interface to an 8GBytes DDR2-667 memory 
and three hypertransport links to connect other pro- 
cessors. All the nodes in the system are connected 
through a full-bisectional fat-tree network consisting 
of four interconnection links of 8GBytes/sec aggre- 
gate bandwidth with Infiniband. For numerical test 



3.3. Results 

Figure [2] shows a representative case for residual 
norm as a function of number of iterations for modified 
block BiCGSTAB. We observe one of important fea- 
tures of block Krylov subspace methods that the num- 
ber of iterations required for convergence decreases as 
the block size L is increased. 



Lx 12/L 


time[s] 


T(gain) 
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5.4 
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Table 1 : L dependence for time, gain factor of time, number of matrix- 
vector multiplication and its gain factor. Central values are given for 
gain factors. 

The results are summarized in Table [TJ Second col- 
umn is total time to solve the Wilson-Dirac equation for 
all 12 colour-spin components at one local source. In 
case of L = 6, for example, 12 right-hand side vec- 
tors are divided into two blocks: Bi = [e\,...,e^\ and 
Z?2 = [e-j, ...,ei2]. Third column is gain factor of time 
compared with L = 1 case. Fourth and fifth columns are 
number of matrix-vector multiplication (NMVM) and 
its gain factor, respectively. Modified block BiCGSTAB 
with L = 12 is about 5 times faster than L = 1 case. The 
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iteration number is reduced by a factor of three. Addi- 
tional speed-up by a factor of two is obtained by cache 
tuning. This is roughly consistent with the enhancement 
of Flops/Byte from 1.05 at L = 1 to 1.95 at L = 12 plot- 
ted in Fig.Q] 

4. Conclusions 

In this paper, we have carried out numerical test for 
block BiCGSTAB with two modifications of the QR 
decomposition and the SAP preconditioner in lattice 
QCD at almost physical quark masses. The QR de- 
composition successfully removes the problem in block 
BiCGSTAB that is the deviation between the true and 
the recursive residuals. We find remarkable cost reduc- 
tion at large L due to smaller number of iterations and 
efficient cache usage. Further gain could be expected in 
calculations of disconnected diagram and reweighting 
factor, where larger value of L is required. One con- 
cern is that numerical cost for modified Gram-Schmidt 
method increases as 0(L 2 ). 
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